NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Using consensus-based reasoning and large language models to extract structured data from surgical pathology reports

https://doi.org/10.1016/j.labinv.2025.104272

Tripathi, Aakash; Waqas, Asim; Venkatesan, Kavya; Ullah, Ehsan; Khan, Asma; Khalil, Farah; Chen, Wei-Shen; Ozturk, Zarifa Gahramanli; Saeed-Vafa, Daryoush; Bui, Marilyn M; et al (December 2025, Laboratory Investigation)

Free, publicly-accessible full text available December 1, 2026
Employing Consensus-Based Reasoning with Locally Deployed LLMs for Enabling Structured Data Extraction from Surgical Pathology Reports

https://doi.org/10.1101/2025.04.22.25326217

Tripathi, Aaksh; Waqas, Asim; Venkatesan, Kavya; Ullah, Ehsan; Khan, Asma; Khalil, Farah; Chen, Wei-Shen; Ozturk, Zarifa Gahramanli; Saeed-Vafa, Daryoush; Bui, Marilyn M; et al (April 2025, medRxiv)

Surgical pathology reports contain essential diagnostic information, in free-text form, required for cancer staging, treatment planning, and cancer registry documentation. However, their unstructured nature and variability across tumor types and institutions pose challenges for automated data extraction. We present a consensus-driven, reasoning-based framework that uses multiple locally deployed large language models (LLMs) to extract six key diagnostic variables: site, laterality, histology, stage, grade, and behavior. Each LLM produces structured outputs with accompanying justifications, which are evaluated for accuracy and coherence by a separate reasoning model. Final consensus values are determined through aggregation, and expert validation is conducted by board-certified or equivalent pathologists. The framework was applied to over 4,000 pathology reports from The Cancer Genome Atlas (TCGA) and Moffitt Cancer Center. Expert review confirmed high agreement in the TCGA dataset for behavior (100.0%), histology (98.5%), site (95.2%), and grade (95.6%), with lower performance for stage (87.6%) and laterality (84.8%). In the pathology reports from Moffitt (brain, breast, and lung), accuracy remained high across variables, with histology (95.6%), behavior (98.3%), and stage (92.4%), achieving strong agreement. However, certain challenges emerged, such as inconsistent mention of sentinel lymph node details or anatomical ambiguity in biopsy site interpretations. Statistical analyses revealed significant main effects of model type, variable, and organ system, as well as model × variable × organ interactions, emphasizing the role of clinical context in model performance. These results highlight the importance of stratified, multi-organ evaluation frameworks in LLM benchmarking for clinical applications. Textual justifications enhanced interpretability and enabled human reviewers to audit model outputs. Overall, this consensus-based approach demonstrates that locally deployed LLMs can provide a transparent, accurate, and auditable solution for integrating AI-driven data extraction into real-world pathology workflows, including cancer registry abstraction and synoptic reporting.
more » « less
Free, publicly-accessible full text available April 25, 2026
Reasoning Beyond Accuracy: Expert Evaluation of Large Language Models in Diagnostic Pathology

https://doi.org/10.1101/2025.04.11.25325686

Waqas, Asim; Ullah, Ehsan; Khan, Asma; Khalil, Farah; Ozturk, Zarifa Gahramanli; Chumbalkar, Vaibhav; Saeed-Vafa, Daryoush; Jameel, Zena; Chen, Weishen; Trejo_Bittar, Humberto; et al (April 2025, medRxiv)

Abstract BackgroundDiagnostic pathology depends on complex, structured reasoning to interpret clinical, histologic, and molecular data. Replicating this cognitive process algorithmically remains a significant challenge. As large language models (LLMs) gain traction in medicine, it is critical to determine whether they have clinical utility by providing reasoning in highly specialized domains such as pathology. MethodsWe evaluated the performance of four reasoning LLMs (OpenAI o1, OpenAI o3-mini, Gemini 2.0 Flash Thinking Experimental, and DeepSeek-R1 671B) on 15 board-style open-ended pathology questions. Responses were independently reviewed by 11 pathologists using a structured framework that assessed language quality (accuracy, relevance, coherence, depth, and conciseness) and seven diagnostic reasoning strategies. Scores were normalized and aggregated for analysis. We also evaluated inter-observer agreement to assess scoring consistency. Model comparisons were conducted using one-way ANOVA and Tukey’s Honestly Significant Difference (HSD) test. ResultsGemini and DeepSeek significantly outperformed OpenAI o1 and OpenAI o3-mini in overall reasoning quality (p < 0.05), particularly in analytical depth and coherence. While all models achieved comparable accuracy, only Gemini and DeepSeek consistently applied expert-like reasoning strategies, including algorithmic, inductive, and Bayesian approaches. Performance varied by reasoning type: models performed best in algorithmic and deductive reasoning and poorest in heuristic and pattern recognition. Inter-observer agreement was highest for Gemini (p < 0.05), indicating greater consistency and interpretability. Models with more in-depth reasoning (Gemini and DeepSeek) were generally less concise. ConclusionAdvanced LLMs such as Gemini and DeepSeek can approximate aspects of expert-level diagnostic reasoning in pathology, particularly in algorithmic and structured approaches. However, limitations persist in contextual reasoning, heuristic decision-making, and consistency across questions. Addressing these gaps, along with trade-offs between depth and conciseness, will be essential for the safe and effective integration of AI tools into clinical pathology workflows.
more » « less
Free, publicly-accessible full text available April 12, 2026

Search for: All records